Efficient Neural Audio Synthesis
نویسندگان
چکیده
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4× faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.
منابع مشابه
Musical Audio Synthesis Using Autoencoding Neural Nets
With an optimal network topology and tuning of hyperparameters, artificial neural networks (ANNs) may be trained to learn a mapping from low level audio features to one or more higher-level representations. Such artificial neural networks are commonly used in classification and regression settings to perform arbitrary tasks. In this work we suggest repurposing autoencoding neural networks as mu...
متن کاملSynthesis of nickel ferrite nanoparticles as an efficient magnetic sorbent for removal of an azo-dye: Response surface methodology and neural network modeling
In this research, nickel ferrite (NiFe2O4) nanoparticles (NFNs) are prepared through coprecipitation method, and applied for adsorption removal of a model organic pollutant, methyl orange (MO). The characterization of t...
متن کاملSpeaker-independent 3D face synthesis driven by speech and text
In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio...
متن کاملChar2wav: End-to-end Speech Synthesis
We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoderdecoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocode...
متن کاملA Neural Network Principal Component Synthesizer for Expressive Control of Musical Sounds
This dissertation introduces a connectionist model that maps perceptual controllers to synthesis parameters to allow for an intuitive and powerful musical control of audio synthesis. This model, or system, allows the extraction, abstraction, reproduction and transformation of relevant features of a musician's style. All the information is deduced exclusively from audio. No prior knowledge of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.08435 شماره
صفحات -
تاریخ انتشار 2018